Haplogroups, Haplotypes and genealogy, oh my
I received some criticism the other day over my own criticism of the use of haplogroups in genealogy -- the finding and tracing of relatives. My language was imprecise so I want to make a correction and explore the issue in a bit more detail.
One of the most basic facts of inheritance is that while most of your DNA is a mishmash of your parents (and all their ancestors before them) two pieces of DNA are passed down almost unchanged. One is the mitochondrial DNA, which is passed down from the mother to all her children. The other is the Y chromosome, which is passed down directly from father to son. Girls don't get one. Most of the mother's X chromosome is passed down unchanged to her sons (but not her daughters) but of course they can't pass it unchanged to anybody.
This allow us to track the ancestry of two lines. The maternal line tracks your mother, her mother, her mother, her mother and so on. The paternal line tracks your father, his father and so on. The paternal line should, in theory, match the surname, but for various reasons it sometimes doesn't. Females don't have a Y, but they can often find out what Y their father had if they can sequence a sample from him, his sons, his brothers and other male relatives who share his surname.
The ability to do this got people very excited. DNA that can be tracked back arbitrarily far in time has become very useful for the study of human migrations and population genetics. The DNA is normally passed down completely but every so often there is a mutation. These mutations, if they don't kill you, are passed down. The various collections of mutations are formed into a tree, and the branches of the tree are known as haplogroups. For both kinds of DNA, there are around a couple of hundred haplogroups commonly identified. Many DNA testing companies will look at your DNA and tell you your MTDNA haplogroup, and if male, your Y haplogroup.
There is another way to type Y DNA, however, known as the short tandem repeat. This test relies on changes in the so-called "junk DNA" of repeated sequences. SNP mutations (which drive haplogroups) are rare, but STR changes are more common. Your list of STRs is called a haplotype and is much more specific.
Services like 23andMe give you your haplogroups. You will share a haplogroups with your siblings, one of your parents, one of your grandparents, one of your great-grandparents and so on. You will also share it, by descent, with some fraction of all those people's descendants. (With two haplogroups you share with 2 of each generation.)
Your family tree is very bushy. Go back 10 generations, and you have 2^10 or 1024 ancestors at that level. Most people actually have slightly fewer because we all start duplicating ancestors before that point. Go back 100 generations (2,000 years) and you would have a million trillion trillion ancestors, if not for the fact that each one re-appears billions of trillions of times in your family tree. Yes, more than a Sagan. However, the maternal and paternal lines identify just 2 different ancestors out of that huge tree, the one on the far left and the one on the far right.
Rules of the patriarchy aside, these ancestors are not any more special or important than any others. What is different is that we can see them in the DNA. Sometimes, because they are the only thing about the past we can see, we get overly excited, as in the old joke:
A scientist is seen crawling around a lamp-post looking for something. When asked what he is doing, he says, "looking for my car keys."
"Did you lose them here?" asks the other scientist.
"No, I lost them over there by my car. But the light is here."
I've watched people looking for relatives get very excited to discover somebody with the same haplogroup. They feel a connection. They start to feel that they must be related to the person, or if they are a suspected relative, that they will be related along the paternal or maternal lines. In fact, all the haplogroup shows is a common ancestor thousands or even 10s of thousands of years ago. One common ancestor out of millions of common ancestors. For if you go back that far, all people from our rough geographic region are common ancestors, not just the mothers and fathers of the haplogroups. People get really excited about the idea of a mitochondrial Eve and a Y-Adam, even though we are also commonly descended from everybody who was alive and offspring-productive in those eras. Everybody was the ancestor of everybody if you go back that far. In fact, the "everybody then was the ancestor of everybody now" period known as the "universal ancestor" point would be fairly recent if it were not for the geographic isolations of populations on Australia, the Americas and many Islands. Everybody today who has successful children will be the ancestor of everybody in perhaps just 1,000 years if current trends follow as they are. That's because now we have airplanes, and even 50 generations at a zero-population growth rate of 2.3 children per couple, you get a million trillion theoretical descendants.
Some haplogroups are more rare than others, and some are much more common within certain ethnic groups and locations, due to the slowness of pre-20th century human migration. As such they help population geneticists track patterns of migration.
Because there are so few haplogroups, a match is not particularly unlikely, and if you have one of the haplogroups that is particularly common in your ethnic or geographic grouping, a random match is actually fairly likely. A non-match on haplogroups does confirm that the other person is not your relative on that very specific family line, but not that they aren't your relative. A non-match with a sibling or other close relative for whom the known family tree demands a match does indicate a mismatch of genetic and known parentage. And a match on a very close relative (1st cousin and perhaps 2nd cousin) does indicate it is probable (but not certain) the common ancestor will be on the appropriate line.
But on the other hand, consider a haplogroup match with somebody more distant, like a possible 5th cousin. There have been 6 generations from the common ancestor to you. The common ancestors will be one of your 32 pairs of g-g-g-g-grandparents. Of those 32 pairs, one has your maternal line g-g-g-g-grandmother. (Let's look at the maternal haplogroup for now.)
But those 32 pairs all started their own family trees, and the children, 6 generations on, are you and your cousins, all the way up to 5th cousins.
How many such children do they have? That varies a lot, but one thing that doesn't vary two much is that only half will pass down their haplogroup. For Y-haplogroup, only the sons will get it at all. For MT-haplogroup, all will get it but only the girls will pass it on.
So for that 6-generation mother who gave you your haplogroup, only 1/32nd of her descendants will get her haplogroup, including your immediate family. In fact, many of them won't be 5th cousins because they are closer relatives. For example, if the matriarch had just one daughter and 2 sons, then none of your 5th cousins from her got the haplogroup, as everybody is closer. If she had two daughters, the other daughter (gggg-aunt) passes it on but with no multiplier that generation. It varies a lot but there's an argument that a typically only 1/64th of her descendants will get her haplogroup and be 5th (but not 4th or closer) cousins. If we also assume that all the other ancestors had children at a similar rate, we now see that the odds of a 5th cousin or closer having also gotten her haplogroup are about 1 in 2000, and if we remove the closer cousins about 1 in 1700.
With the Y-group, as only half the people in the current generation have it, it's twice as unlikely.
Thus the problem. A 5th cousin sharing your haplogroup has only 1 in 1700 odds of doing it through the common ancestor. But there are only around 200 haplogroups. So it's much more likely they share it just by chance at this level. Worse, if your haplogroup is a common one in your population, it is a great deal more likely to have happened by chance. Of course this doesn't mean it didn't happen by descent, and in fact there will be just a few for whom it did. But it is a bad assumption to make.
Another way to see this is to imagine that surnames are passed down as reliably as Y chromosomes. They aren't, because there are lots of non-sired children due to adoption, infidelity and even sperm donation. In additions, surnames change over time and many cultures did not even have surnames in the fairly recent past. As you probably realize, only a tiny fraction of your 5th cousins share your surname -- again just one in 2,000. (One in 4,000 if the women of your generation have all changed surname, which of course does not always happen any more.) The haplogroup is something akin to the first two letters of your surname. So if I meet a cousin, and all I learn is that his or her birth surname starts with "Te" the odds are again much more likely that it's something else, and not Templeton. No surname, not even Smith or Chang, is as frequent as some haplogroups are in their main populations, though.
When it comes to the haplotypes, things are much more specific. Some companies, like Family Tree DNA, will scan 37 or 67 of the STR markers. They claim that a match on 37 markers indicates a recent ancestor "within the period of human record keeping" which is known as the era of genealogy. And a match on the 67 markers indicates a common ancestor "within recent times" though they don't specify a time, it is implied that it's less than 2 centuries. Such matches are more useful for genealogy. Now it still remains the case that only a small fraction of your distant cousins will share your Y and its haplotype, so the odds of finding the sort of matches hoped for are still rare. But unlike the haplogroup, if you find a match in the haplotype it is quite probably real, particularly if the surname matches.
Over time all of this is going to get better, and in particular it's going to get better through the non-haploid DNA, which involves all your ancestors, not just these 2 at the edges of your bushy tree.
In particular, the more people map their DNA, the more individual segments will start getting tracked and even attributed to particular individuals. We should be able to get partial reconstructions of the genomes of people who lived several generations ago by sequencing enough of their descendants. From these reconstructions, combined with more traditional family tree maps, DNA sequencing should, before too long, be able to plunk you down pretty precisely into a fairly complete family tree. Because while most people don't know their great-great grandparents (or even the great-grandparents) all it takes is some descendants who did know them to key in and sequence the appropriate data. Just one link will tie people into a vast web.
How much does this mean? Probably not a lot. I see no special bond among 3rd and later cousins, not even among 2nd cousins. I'm told I had a 3rd cousin who lived on the next street over when I was a kid. I don't even remember anything about that. It may have an effect on people who perhaps hold prejudices against some ethnic groups who discover they are part of that group, who knows?
Our connection to the more distant "haplomothers" who were the first members of a haplogroup is even more remote. Most haplogroups formed 10,000 to 30,000 years ago. This is past the "universal ancestor point" which means that you are descended from almost everybody alive in that era, not just from the haplomother. She's just one of millions from whomyou are descended. And because you only have under 30,000 protein encoding genes, chances are you got none of your genome (except the mitochondria) from her. And your mitochondria are already almost identical to everybody else's on the planet. She is a truly meaningless ancestor, and the only thing that makes us pay attention is that we can identify approximately when she lived, and track broad populations of people using that.