Using git for genealogy collaboration



Are there any existing projects or proposed standards for using git (a source version control system) to collaborate on building genealogical source records and/or conclusions?


Git (best known on hosted sites like github and bitbucket, but available as a command-line or graphical application on most platforms) is a way to store (mostly) text and programming source files, so that changes to individual lines within the files can be made by different people in different places, and merged together with a full audit trail of who changed what (called "blame").

It's also possible to split data off into a branch, make proposed changes to it, and then submit it back to those responsible for the main branch as suggested improvements, which can be accepted or rejected, and merged back into the main branch. So it's not a free-for-all like a wiki, there are ways to have control over data yet still accept improvements.

It seems it would make a good fit for genealogy source data, which is often text-based (perhaps with images attached). It would make it easy for transcriptions to be added to and corrected by many people, and more importantly, see who changed it.

It could also be a way to store assertions, and provenance trails - show how a conclusion was reached, in a format that allows corrections and new data.

Existing projects

There are projects around to store gedcom data in git (example: git-ged), which is one way of handling legacy data, but I'm thinking it would be more useful for newer standards that are source-based, rather than just for conclusion reporting formats like Gedcom.

Many of the new standards like Gedcom-X are XML-based (perhaps with an alternate json representation). Neither XML or (to a lesser extent) json map well to a line-based version control system (because it can be hard to retain the structure when a individual lines are changed by different people at the same time). It's also difficult for a human to edit those formats.

Although gedcom is line based it again has a structure that is easy to mess up with multiple editors, and has other limitations (lots of cross-references that are easy to break).

Extending Git

Git (and github) have been extended to display data in different ways, for example to map the data if it is in geojson (a geographic point/line format), or to show tables if it is in csv. So there could be certain requirements (restrictions) of a text-based genealogy format that would make it possible to generate reports from it. It doesn't have to be directly usable in a program, just something that can be edited by multiple people in a fairly easy way.


For source data (lists of events, simple transcriptions) it's maybe simplest just to stick to csv (comma separated variables) files, that can be easily imported (and exported) by most software.

It would be good to have a csv description language with a genealogy vocabulary that describes the data. The simple data format is one such way to describe csv files. Has anybody been working on marking up genealogy source files with a system like this?

Open Data

Much of this assumes that the source data (events) are in a format which is shareable (and so editable by many). Unfortunately very little data currently has suitable licenses, but that is changing.

Rob Hoare

Posted 2013-10-18T22:59:26.040

Reputation: 6 076

I tried to maintain a viewer app with a Git-controlled YAML database continuously exported from Gramps. Ditched the idea today because it just didn't work (either file too big or relations between entities are invisible). Moving that app to MongoDB; will continuously export the data to JSON for version control. Basically you can do the same with Gramps (XML or JSON). – Andy Mikhaylenko – 2018-04-08T01:58:40.727

Sam, that Gramps Connect project does look it could eventually be interesting for users of that software (as it's based very much around their Gedcom-derived file format). I wasn't able to find anything which states what it is supposed to do though, and it appears to have been abandoned a year ago (which is understandable, it's a hard problem, especially when working around legacy file structures). Something that avoids gedcom to work on data at a more event-based level would fit better with the source data (not just conclusions) that is now available. – Rob Hoare – 2013-10-19T00:26:10.987

Sam, I'm trying to avoid anything specific to any particular language (like the Dulwich Python interface to git). First step is more of a lower level, what representations of genealogical data can fit a version control system, are there any standards or new ideas, does anything exist (using more flexible formats than gedcom)? In other words, how to store data in format that can be widely used. With a standards-based format (even if that standard is just csv) it shouldn't matter what language is used to access it later, nor indeed which text-based source control system if git ever goes away. – Rob Hoare – 2013-10-19T00:30:01.367



I realize this is a bit of a ramble, and probably not the answer you are looking for. There is something of real value in the question you are asking, and a solution (by extension) would be equally valuable. I would start by splitting up the answer into several components:

  1. Version control & collaboration
  2. Data format
  3. Visualization
  4. Integration

Version Control & Collaboration

The question here is whether git is a good solution to collaboration, and I would say yes, certainly. Assuming you can solve the other components to this problem, git provides several advantages that I would see applicable to genealogical research:

  • The ability to share data between researchers.
  • The ability to 'branch' a body of work, to experiment with new data to see how it applies to a record(s) before committing the change.
  • (Specifically for GitHub/Bitbucket) The ability to submit pull requests between efforts. This gives two (or more) genealogists the ability to reason about changes to a series of data before settling on the 'master' result. This is probably the single greatest win git provides above other version control systems.

To that last point, I would think there is an advantage to version controlling a genealogical research effort with any kind of tool you can get your hands on. At this time I would think git one of the best options available.

Data Format

Data formats are an implementation detail. It is what you do with that data that is more important than how the data is stored. In order for systems like git to manage the data well, a text-based format would be necessary. Formats like GEDCOM, XML and JSON have their own (dis)advantages, so without getting into a religious argument I would simply offer this advice: pick one and move forward. If it is done right, any solution should be able to migrate from one underlying format to another without the user being the wiser.


This horse has been beaten dead, and is not worth revisiting at this point. There are many ways to parse through genealogical data, browse it, and modify it. A solid solution here is essential to make sure non-technophiles are able to use whatever system is being offered.


This is the million-dollar issue. Can you make a front-end that works seamlessly with some text-based file format, and have it manage everything with a version control system on the back side? If you think of a researcher's data as code, what you are talking about here is a genealogical IDE, and something that would be a huge tool to have in one's toolbox. Open source IDEs exist (Eclipse comes to mind) that I would think can be extended to manipulate and visualize genealogical data. The git integration is already there, so it is a question of tying it all together.


Posted 2013-10-18T22:59:26.040

Reputation: 4 906

I agree with @RobHoare's comments. Not caring about the format is like not caring whether people represent their code as C language, or preprocessor output, or compiled executables. C language works well with git. Compiled executables? Not so much. Remember, humans have to work with the diffs between commits, and that can get really messy even with a good front end. It's even messy with git used on Jupyter notebooks (a json intermediate format). And you don't want a change of tooling to reformat lots of data creating a huge diff, so you want canonicalization layers and standards like pep-8. – nealmcb – 2020-07-06T15:00:54.997

2I really like your idea of thinking of the front-end as an IDE. An open-source web IDE with a good ways to extend it like Codiad or Cloud9 (among others) could be a way to start. I do think though that the data structures (not the data format) are a crucial element, not a detail. Some ideas to think about here. – Rob Hoare – 2013-11-10T04:38:40.247


You already mentioned git-ged, which is really the only thing at github that comes close to what it seems you want.

Dovy Paukstys added a git entry for SourceTemplates two years ago, but you won't find anything there.

Git is for real geeks and/or programmers, and most genealogists aren't up to snuff to use it.

The tool for most genealogists that is most like what you seem to be asking for (with audit trails, merging, branches, etc.) would be a wiki. They vary in the number of features they have.

WikiTree is the most popular, but there's a whole bunch of others such as these listed at GenSoftReviews.

Some of the web-based programs create full audit trails and allow collaboration such as PhpGedView which is open source, and programs like that might fit many of your requirements.


Posted 2013-10-18T22:59:26.040

Reputation: 16 148


Note that the last release of PhpGedView was 26 December 2009. Some of the developers forked the project and is under very active development.

– Randy Orrison – 2013-10-22T11:41:02.020


Genealogists will never use git directly, but a frontend is eminently possible.

Neither XML or (to a lesser extent) json map well to a line-based version control system (because it can be hard to retain the structure when a individual lines are changed by different people at the same time).

The only real problem with using git for version control of hierarchical text-based data formats is merge conflicts due to changes in adjacent lines. These can be easily resolved by parsing the full surrounding data entity of those lines and determining whether they actually differ in the two branches.

In any other cases when two commits change the same line, you want there to be a merge conflict so the user has to change their data (or rebase/merge upstream, but that would be a ton of work) in order to make it merge cleanly.

To enable that property, you need the text format to keep records sorted and indented identically so there are no superfluous changes with almost guaranteed merge conflicts.

For this to work with a format where records are referenced by ID (probably all of them), you need that format to use UUIDs for internal IDs. (Otherwise there will be conflicts due to sequential ID generation.)

Gramps XML meets all the requirements for this.

Gramps could be extended to mirror each database as a git repository containing the uncompressed XML export of the database. You will only need to export or import when doing checkout, branch, pull, push, or merge operations; for the rest of the time Gramps will still operate on its internal DB.

Genealogists are never going to host their own git servers, so you would need to make a centralized web app where database administrators can review and merge changes. For any commit, just load the entirety of each of the two versions of the file as an object in memory, diff the two versions, and display the difference in a human-readable format.

Genealogists can then update their local copies by clicking a "pull" button. You'd want to keep a separate branch solely for tracking the remote so you never have merge conflicts with that, and then when the genealogist pulls new changes, branches with merge conflicts against the new changes could be displayed.

(Use pygit2 or the like.)


Posted 2013-10-18T22:59:26.040

Reputation: 111

1No one I know hosts their own git server, they all use github or another low priced commercial service. So I would expect that genealogists would do the same thing. – Pat Farrell – 2016-02-06T22:40:25.543


I've been thinking along the lines of using git for genealogy data for the genealogy program that I'm considering writing. Obviously, using GIT would require a front-end that is user friendly and specifically designed for genealogy usage.

The attraction is to be able to say "this fact about ancestor D was entered by user X." But there is a bit of a split in philosophy when what we want to record is "based on these three questionable sources, user X has concluded that ancestor D was born in XYZ"

In the git world, and in all other source code control systems that I've used over the decades, the guiding principle is that there is one correct way for the lines of code to be. Code has a bug, and this patch fixes it. Or the code was missing this feature, and this patch adds the feature.

In genealogy, we often want to record stuff like "while based on these three questionable sources, user X has concluded that ancestor D was born in XYZ, user Y believes that these two additional sources make a strong case that ancestor D was really called E, and he was born in UVW"

I am not sure that using git (or any other source-code oriented tool) will really help us do what we want.

Pat Farrell

Posted 2013-10-18T22:59:26.040

Reputation: 356

You can use git blame on a text-based format to show which user recorded which facts. – mwhite – 2016-02-08T14:20:12.190

Personally, I think the appeal of representing evidence in a structured way is overrated. Just use the basic genealogy format for representing conclusions and then let users discuss evidence in notes and on pull requests. – mwhite – 2016-02-08T14:22:30.360

Gedcom-X is a genealogical-source-based standard which addresses your concern, allowing people to present the sources of their ancestry claims - check it out. It would be fine to use git on a suitable source-based standard, as the OP noted. – nealmcb – 2020-07-06T15:19:37.593