## Statistical Genealogy, or Knowing what is an extraordinary claim

21

3

It is often said in genealogy (or any field of research, really) that “extraordinary claims require extraordinary evidence”. Fair enough. But how does one know what is an extraordinary claim? We can extrapolate from our own experience and society, but our ancestors often lived in very different times and circumstances. What might seem extraordinary or strange to us might have been quite common to them and vice versa.

I'm a moderator on the Mathematica.SE site, so I’m aware of the social network analysis capabilities in Mathematica, even though I’ve never really used them. They, and a recent post on Stephen Wolfram’s blog got me thinking about the genealogical equivalent.

The big sites like Ancestry.com, Familysearch.org and FindMyPast.co.uk have all collected literally millions of trees and built up profiles of tens of millions of individuals and families. It seems obvious that one could build up statistical statements about what was most common historically, and then people could compare their current hypotheses about their own ancestors to these statements to help decide if they are making “extraordinary claims”. I’ll give some examples to show what I mean. Bear in mind that the following are all completely made up, but are the sort of thing I had in mind:

15% of men and 16% of women born in 18th century England married more than once; only 2% of either men or women married more than twice. ( ok so my great^4 grandmother probably didn't marry five times and those are different Alice Smiths... )

Or:

94% of people who were not nobility born in Kent between 1650 and 1800 married someone born within twenty miles of them. Less than 1% married someone born more than 50 miles away. ( ok so I can probably rule out that baptism from Lancashire as being for my great^5 grandfather who married and died in Dorset...)

Or:

Only 10% of men born in the 1700s in England who subsequently married did so for the first time after they turned 30, and if their parents married before the age of 25, only 3% married after the age of 30. ( hmm, so my great^5 uncle is probably that guy, not that guy who married eight years later. So his wife is most likely to be this woman not that one...)

So my question is: has anyone done this, even over a restricted field such as a particular century or country? Surely there is a statistically-minded Mormon over at FamilySearch or wherever who has thought of doing this? While the exceptional cases clearly can still have happened, it would surely help people formulate their hypotheses if they had an idea how likely or unlikely certain outcomes might have been.

EDIT

Lots of good responses in the answers and comments. I agree that the data are very dirty. But I'm an economist and thus used to dealing with messy, poorly measured data. The question is whether the errors in the data are enough to materially skew the kinds of statistical statements I have in mind. I can imagine that there would be some skew away from correct conclusions that happened to be unusual (the thrice-married woman, the 17th century couple born 100 miles apart etc). But for a lot of things, the errors in people’s trees would cancel each other out. I think this is the difference between a statistical approach and an historical approach.

I'm curious how you'd see the results being skewed by the huge amounts of duplication in the "big trees", even setting aside the other quality issues. – None – 2013-04-27T10:12:10.223

@ColeValleyGirl - think you'd have to eliminate duplicate personas from the data set before doing something like this. But if most of the questions are about individuals, couples and families, the database you'd use for statistical work would be based on a relational database that could cut the data in the required way, not trees. – Verbeia – 2013-04-27T21:45:54.130

"you'd have to eliminate duplicate personas from the data set..." That would be a really neat trick. Kind of the holy grail of big tree genealogy. – None – 2013-04-28T09:56:37.953

I have been looking for something similar for a number of years. I would like to boil it down into a single confidence number (i.e., the person you are looking at is 95% probability it is the person you think it is) and have such a number for every person in the tree. I started looking at basic probability but didn't get very far. For example, there are n John Smiths in the 1881 census, then any one is 1/n%. Given that census is only x% complete then each one is x / n%, you get the idea. Very interested in knowing what you discover. – Magic Bullet Dave – 2014-12-21T15:00:56.503

You could build up statistics about what information got copied the most, but I doubt you could learn much more from analyzing compiled trees. There might be cases where the outlier tree which disagrees with everyone else might be the most accurate, because the owner had sought out more sources and done a better analysis of the available information. – Jan Murphy – 2014-12-21T19:46:47.017

10

An example of a study of probable outcomes is: Probert, Rebecca. Marriage Law for Genealogists: the definitive guide.( Kenilworth: Takeaway (Publishing), 2012.). It is based in part on detailed analysis of primary source data about marriage for "entire communities and cohorts" in England, and addresses e.g. :

• How many couples lived together without marrying, based on analysis of illegitimacy rates in the 18th and 19th century (which peaked at 7% in the 1860s).
• what proportion of couples from a single county married in the home parish of one of the couple (41% between 1700 and 1751)

My personal opinion is that the data in trees on FamilySearch, Ancestry etc. is too poor to be a valid basis for statistic analysis.

This is the closest to a direct answer to my question, so thank you. But see my edit to my question: statistically, data with measurement error can still be valid for analysis, but the confidence in conclusions will be reduced. The question is whether the errors in the data bias the conclusions. I am not so sure this is necessarily true. – Verbeia – 2013-04-27T03:02:37.053

7

Without wishing to pre-empt a (more expert) contribution from a statistically-minded Mormon (sic), I suggest that your idea (while superficially very attractive) demands that four things come together to make it work.

1. Analytical tools capable of crunching the relevant free-form data
2. Sufficiently extensive big data stores "open" to exploitation
3. Genealogical knowledge of the types of hypothesis that need a reality check
4. Confidence in the "cleanliness" of the data.

As you suggest, Wolfram has shown us that 1 can be done. All the advertisements claim that 2 is a reality. Forums such as SE could provide 3.

But, IMHO, we stumble at 4. How can I have any confidence in your claim that great grandma did not marry 7 times when your "evidence" is based on a dataset in which grandma is duplicated at least a dozen times. We know from the opening up of FS Family Tree that this situation is the norm rather than an exception. Lest anyone claim that Ancestry does better, I have one word for you - Mocavo.

Last century, big-iron number crunchers encapsulated the problem in the acronym GIGO. It will be difficult to get anyone to put in the effort to carry out the analysis or to take the results seriously when your raw material (taken en masse) is so clearly garbage.

That was my first thought too. If you can find a clean data set then the rest is almost trivial (in comparison, at least). – None – 2013-04-25T13:16:35.813

And it was mine as well, but I wanted to put the idea out there. Also, I think it's apparent that Ancestry.com already has some kind of algorithm for ranking user trees by quality, e.g. by how many sources are used. It would be straightforward for them to exclude trees that are just copied from other people's trees, by the sourcing data. – Verbeia – 2013-04-25T20:58:02.373

1My view is that the UK census data is as clean as anything gets and could therefore be used to answer movement-type questions for the 19th century. How relevant those numbers would be for earlier times is a moot point. But surely an analysis of that data would assist with the question of "How far afield is it reasonable to look for a birth/baptism in parish registers?" (Nothing annoys me more than those who criticise newbies for only looking in the same parish for someone's origins but never say how far afield people should look! And yes - clearly the answer is "It depends.." But on what?) – AdrianB38 – 2013-04-25T21:23:20.077

"Lest anyone claim that Ancestry does better, I have one word for you - Mocavo." -- Perhaps you could provide a few more words on the subject? What about Mocavo? – Keith Thompson – 2013-04-26T00:26:01.577

While this is not the place to "discuss" my views, the implication is that Mocavo is an Ancestry product riddled with duplicate (and frequently incompatible) personae. – Fortiter – 2013-04-26T01:43:02.170

@Fortiter: A reminder, if you had tagged your reply with, I would have been notified when you posted it. I saw your reply only because I came back and checked. Thanks for the info. – Keith Thompson – 2013-04-26T22:23:33.450

I wonder if some of the collaborative family trees (WeRelate, WikiTree, etc.) might eventually have enough quality data for something like this. Genealogy is too messy to be done without human intervention, but it seems like alerts of potential bad decisions, or suggestions for places to look could be helped by this sort of data. – Jeremy – 2013-04-27T00:37:12.267

4

The kind of generalizations you've used as examples have been formulated (although usually without statistical rigor) by many researchers working with original sources instead of conclusion trees. Naming, occupation and residence patterns and surname distribution, for example, may be readily determined if the source includes enough detail. One-place studies are invaluble for looking at this kind of data across multiple source types. There have been articles published in sociological journals that do include the math, but the topics tend to be less helpful to the average genealogist.

Can you point us to examples of one-place studies and the formulated generalizations you mentioned? – None – 2013-04-25T14:20:57.877

One-place studies: http://www.one-place-studies.org/list-contents.html, http://www.ortsfamilienbuecher.de/index.php, http://wiki-de.genealogy.net/Kategorie:Ortsfamilienbuch. Finding a link to published generalizations will take a while; an unpublished list of my own for my mother's side is somewhere in my notes.

– bgwiehle – 2013-04-25T14:30:45.680

Stat Ranger by Tim Forsythe is really neat: http://timforsythe.com/tools/ranger It has a statistics page that summarizes the results from hundreds of GEDCOMs here: http://timforsythe.com/tools/stats

– lkessler – 2013-05-01T02:13:18.993

4

I googled "statistical genealogy" - in the expectation of no hits and found this question.

LitvakSIG, the Lithuanian-Jewish research group (http://www.litvaksig.org), publishes data to paid-up contributors in the form of excel files: and this makes this sort of study straightforward for a particular locality. One can even do some sorts of longitudinal time studies for those locales with enough data.

So, in an amateur way, I have been posting over the last 6 years about Jewish marriages, deaths and population growth on one district in north east Lithuania. Do please drop by at http://zarasai.blogspot.co.uk/search/label/Statistical%20genealogy.

The posts include:

• a quick look at population growth in one shtetl over the period 1784 to 1887 which shows a nice log curve with population doubling about every 16 years.

• age at death statistics for the period 1922 to 1939 for two districts of interwar Lithuania - comparison to Swiss mortality for 1928 with good eye-ball correlation - odd dip in mortality at ages 71 to 75 - possible use of Swiss data for earlier period to gain possible insights into Litvak deaths at that time.

• analysis of marriage data for 1877 to 1915 which shows a steady increase in age at marriage of about 1 year every decade to about 1900 and then a sudden drop - about when mass emigration accelerated. Some analysis of the geographical dispersion of spouses with about 70% from nearby locations within 50 km, but some from Georgia and Dagestan - about 2,800 km away by road.

20/21 December 2014

1Hello, Paul -- welcome to G&FH.SE! Could you please edit your answer to include an abstract of the post which you mentioned, including the date that it was posted? We would like to have enough information to make your answer useful in case the link to your blog is broken. – Jan Murphy – 2014-12-20T23:41:40.010

1

What a fascinating question and discussion. I don't believe that dirty data is an issue here. Any applied research has to deal with effects of data quality and genealogical data doesn't appear to be obviously special in its characteristics.

One of my frustrations with the current genealogical tool sets is that they don't make it especially easy to, for example "chart a time series of %illegitimate births by time hycketed every ten years. When I look at other people's ancestry public trees it appears that there are a good many trees of size 10,000 but fat far fewer trees that are 40k in size

Have you looked at the British names and world surname GIS projects? These are a useful resource that offer statistical insight into specific surnames. When I look at RootsMagic or Mac Family Tree I can produce reports that list improbable facts. It would be great to see these listed in probability order.

I'm wondering what the smallest, lowest scope example of the functionality you describe would be?

1

Hi Peter – feel free to ask your last point as a new question, actually I'd encourage it since it's been 4 years since this was asked. Follow up questions are encouraged if they are not covered by the first question. You can provide a link to this question and quote any relevant details if you like.

– Harry Vervet – 2017-04-27T14:14:37.273

Welcome to G&FH SE! As a new user be sure to take the [Tour] to learn about our focussed Q&A format which is quite different from bulletin boards, discussion forums and other Q&A sites you may be used to. – PolyGeo – 2017-04-28T07:40:21.073