## How to deal with version control of large amounts of (binary) data

30

10

I am a PhD student of Geophysics and work with large amounts of image data (hundreds of GB, tens of thousands of files). I know svn and git fairly well and come to value a project history, combined with the ability to easily work together and have protection against disk corruption. I find git also extremely helpful for having consistent backups but I know that git cannot handle large amounts of binary data efficiently.

In my masters studies I worked on data sets of similar size (also images) and had a lot of problems keeping track of different version on different servers/devices. Diffing 100GB over the network really isn't fun, and cost me a lot of time and effort.

I know that others in science seem to have similar problems, yet I couldn't find a good solution.

I want to use the storage facilities of my institute, so I need something that can use a "dumb" server. I also would like to have an additional backup on a portable hard disk, because I would like to avoid transferring hundreds of GB over the network wherever possible. So, I need a tool that can handle more than one remote location.

Lastly, I really need something that other researcher can use, so it does not need to be super simple, but should be learnable in a few hours.

I have evaluated a lot of different solutions, but none seem to fit the bill:

• svn is somewhat inefficient and needs a smart server
• hg bigfile/largefile can only use one remote
• git bigfile/media can also use only one remote, but is also not very efficient
• attic doesn't seem to have a log, or diffing capabilities
• bup looks really good, but needs a "smart" server to work

I've tried git-annex, which does everything I need it to do (and much more), but it is very difficult to use and not well documented. I've used it for several days and couldn't get my head around it, so I doubt any other coworker would be interested.

How do researchers deal with large datasets, and what are other research groups using?

To be clear, I am primarily interested in how other researchers deal with this situation, not just this specific dataset. It seems to me that almost everyone should have this problem, yet I don't know anyone who has solved it. Should I just keep a backup of the original data and forget all this version control stuff? Is that what everyone else is doing?

If you tell me three things I might have an answer! 1. Does your medium size data gets bigger? If so how periodically. 2. Do you use programming languages/frameworks to do pattern matching and data analysis? 3. Does anyone else use this data? If so do they change it? – None – 2015-02-13T10:40:38.030

I voted to "Leave Open" when reviewing Close Votes queue because I don't think this question is for specific situation. However, can anybody who is familiar with Software Recommendations SE tell us this question would fit into that site? – None – 2015-02-13T10:52:42.327

1@scaaahu I don't think this is necessarily a software question; an acceptable answer could also describe a workflow or combination of tools and systems. (Anyways, being on topic somewhere else shouldn't play into the decision to close a question here.) – None – 2015-02-13T11:01:58.770

2Just to protect against data corruption with image data, I periodically run a script that re-computes a checksum file with all files and their md5 checksums. The checksum file is then kept in git. Now I can immediately see with git diff if any of the checksums have changed. And I can also see which files have been removed & added. And if there are e.g. any signs of data corruption, then I can use the regular backups to restore old versions. Not perfect but better than nothing. – None – 2015-02-13T11:30:36.463

@JukkaSuomela I think you should post that as an answer, not a comment. – None – 2015-02-13T11:48:49.980

@Johann Why not just files, without version control (but with backups)?Piotr Migdal 2015-02-13T12:22:54.293

@PiotrMigdal: Are you seriously asking why people should use version control, instead of just having a bunch of files with backups?-) – None – 2015-02-13T13:01:50.247

1@JukkaSuomela I think it's a reasonable question when you've got very large datasets, if those datasets change frequently... in those cases, backup often is what's used as version control. – None – 2015-02-13T13:29:18.427

1I'm voting to close this question as off-topic because it deals with data/databases rather than something specific to academia. The questions is great, and (IMHO) should be moved to DataScience.SE or (perhaps) Databases.SE.Piotr Migdal 2015-02-13T14:02:35.853

@PiotrMigdal I don't know if this is off-topic, but this question does also not fit well with Databases.SE or DataScience.SE. I would like to know what other researchers/Institute do in practice to deal with this kind of problem - I've updated the question accordingly. – None – 2015-02-15T19:22:05.807

@DaveRose 1. Yes, I will hopefully add more experimental data and processed images, but not very often (maybe a few iterations); 2. Yes, that is part of my thesis; 3. Yes, others will hopefully use and change the data. – None – 2015-02-15T19:32:18.047

@JukkaSuomela That actually sounds pretty good (at least much better than anything I've found so far). – None – 2015-02-15T19:33:45.380

@PiotrMigdal Going without version control kind of pains me for the reasons I've stated in the question (especially: "did my data change with out me noticing?" and then 2 days of diffing by hand). So I am looking for something smarter. – None – 2015-02-15T19:35:56.057

@Johann 1. How you store or version control it is in domain of data science (regardless if you use it in academia, industry or for a hobby project). I really want to ensure the best answers and it is good to go where there are many experts in data. 2. My point was only that it might be not for git. (And, all in all, git is a filesystem). Do you want to diff per file, line, or what?Piotr Migdal 2015-02-15T19:53:37.380

@PiotrMigdal 1) You are probably right, though I am still curious how scientist - without a background in data science - handle that situation, 2) No doubt, git cannot handle that kind of data. A diff per file would be enough, just to see if my data has changed (or I changed it inadvertently). The point is to have control/documentation over how my data changed through what action by whom. – None – 2015-02-16T05:16:23.600

1

@Johann Data scientist have different backgrounds. Mine is in quantum mechanics, for example. The whole point here is that: 1. StackExchange discourages so-called boat questions and 2. its better to get best practices rather than how it is solved by people who had to solve it but had no idea.

Piotr Migdal 2015-02-18T12:33:42.243

@PiotrMigdal Thats a good point - thanks!Johann 2015-02-18T13:59:44.580

@Johann Are you sure you need to version control your dataset? If you are processing a big set of images, you usually want to version control the procedures you followed that led to a modified set of images. Therefore, you usually don't need to keep track the images themselves. If you want to restore this modified set in the future, obviously you only need to take the original dataset and apply the procedures you did according to some commit in your code.Robert Smith 2015-02-21T00:50:06.050

## Answers

5

What I am ending up using is a sort of hybrid solution:

• backup of the raw data
• git of the workflow
• manual snapshots of workflow + processed data, that are of relevance, e.g.:
• standard preprocessing
• really time-consuming
• for publication

I believe it is seldom sensible to have a full revision history of large amount of binary data, because the time required to review the changes will eventually be so overwhelming that it will not pay off in the long run. Maybe a semi-automatic snapshot procedure (eventually to save some disk-space, by not replicating the unchanged data across different snapshots) would be of help.

@norok you've described a great general framework. I've implemented something similar in DVC tool - please take a look at my answer below. I'd appreciate your feedback.Dmitry Petrov 2017-05-13T23:10:46.413

Well, I'm using find . -type f -print0 | xargs -0 md5sum &gt; checksums.md5 to calculate the checksums and md5sum -c checksums.md5 to checksums, and version control the checksums. That helps to check the data at different locations/on different machines. Seems to be the best we can do at the moment,Johann 2015-09-29T21:30:20.153

If by modifying your data, you always change its file name, then it might be good solution. Otherwise, I would highly recommend to check on the data itself, for example with rsync on (a copy of) the original data. One other possibility which is common in neuroscience (although I do not like it so much because sometimes it is not as well documented as it should be), is to use the nipype python package, which can be seen as a (sort of) workflow manager and it manages the cache of binary data of the intermediate steps of the analysis automatically.norok2 2015-10-01T09:11:22.243

8

I have dealt with similar problems with very large synthetic biology datasets, where we have many, many GB of flow cytometry data spread across many, many thousands of files, and need to maintain them consistently between collaborating groups at (multiple) different institutions.

Typical version control like svn and git is not practical for this circumstance, because it's just not designed for this type of dataset. Instead, we have fallen to using "cloud storage" solutions, particularly DropBox and Bittorrent Sync. DropBox has the advantage that it does do at least some primitive logging and version control and manages the servers for you, but the disadvantage that it's a commercial service, you have to pay for large storage, and you're putting your unpublished data on a commercial storage; you don't have to pay much, though, so it's a viable option. Bittorrent Sync has a very similar interface, but you run it yourself on your own storage servers and it doesn't have any version control. Both of them hurt my programmer soul, but they're the best solutions my collaborators and I have found so far.

There is a popular open source version of Dropbox, OwnCloud. I haven't tried it, though. – None – 2015-02-13T21:35:47.000

8

I have used Versioning on Amazon S3 buckets to manage 10-100GB in 10-100 files. Transfer can be slow, so it has helped to compress and transfer in parallel, or just run computations on EC2. The boto library provides a nice python interface.

5

Try looking at Git Large File Storage (LFS). It is new, but might be the thing worth looking at.

As I see, a discussion on Hacker News mentions a few other ways to deal with large files:

5

We don't version control the actual data files. We wouldn't want to even if we stored it as CSV instead of in a binary form. As Riccardo M. said, we're not going to spend our time reviewing row-by-row changes on a 10M row data set.

Instead, along with the processing code, I version control the metadata:

• Modification date
• File size
• Row count
• Column names

This gives me enough information to know if a data file has changed and an idea of what has changed (e.g., rows added/deleted, new/renamed columns), without stressing the VCS.

2

I haven't used them but there was a similar discussion in a finance group

data repository software suggestions scidb, zfs, http://www.urbackup.org/

1

This is a pretty common problem. I had this pain when I did research projects for a university and now - in industrial data science projects.

I've created and recently released an open source tool to solve this problem - https://dataversioncontrol.com or DVC.

It basically combines your code in Git and data in your local disk or clouds (S3 and GCP storage). DVC tracks dependency between data and code and builds the dependency graph (DAG). It helps you to make your project reproducible.

DVC project could be easily shared - sync your data to a cloud (dvc sync command), share your Git repository and provide access to your data bucket in the cloud.

"learnable in a few hours" - is a good point. You should not have any issues with DVC if you are familiar with Git. You really need to learn only three commands:

1. dvc init - like git init. Should be done in an existing Git repository.
2. dvc import - import your data files (sources). Local file or URL.
3. dvc run - steps of your workflow like dvc run python mycode.py data/input.jpg data/output.csv. DVC derives the dependency between your steps automatically, builds DAG and keeps it in Git.
4. dvc repro - reproduce your data file. Example: vi mycode.py - change code, and then dvc repro data/output.csv will reproduce the file (and all the dependencies.

You need to learn a couple more DVC commands to share data through the cloud and basic S3 or GCP skills.

DVC tutorial is the best starting point - "Data Version Control: iterative machine learning"

0

You may take a look at my project called DOT: Distrubuted Object Tracker repository manager.
It is a very simple VCS for binary files for personal use (no collaboration).
It uses SHA1 for checksuming and block deduplication. Full P2P syncing.
One unique feature: adhoc one time TCP server for pull/push.
It can also use SSH for transport.

It is not yet released, but might be a good starting point.
http://borg.uu3.net/cgit/cgit.cgi/dot/about/