30
10
I am a PhD student of Geophysics and work with large amounts of image data (hundreds of GB, tens of thousands of files). I know svn
and git
fairly well and come to value a project history, combined with the ability to easily work together and have protection against disk corruption. I find git
also extremely helpful for having consistent backups but I know that git cannot handle large amounts of binary data efficiently.
In my masters studies I worked on data sets of similar size (also images) and had a lot of problems keeping track of different version on different servers/devices. Diffing 100GB over the network really isn't fun, and cost me a lot of time and effort.
I know that others in science seem to have similar problems, yet I couldn't find a good solution.
I want to use the storage facilities of my institute, so I need something that can use a "dumb" server. I also would like to have an additional backup on a portable hard disk, because I would like to avoid transferring hundreds of GB over the network wherever possible. So, I need a tool that can handle more than one remote location.
Lastly, I really need something that other researcher can use, so it does not need to be super simple, but should be learnable in a few hours.
I have evaluated a lot of different solutions, but none seem to fit the bill:
- svn is somewhat inefficient and needs a smart server
- hg bigfile/largefile can only use one remote
- git bigfile/media can also use only one remote, but is also not very efficient
- attic doesn't seem to have a log, or diffing capabilities
- bup looks really good, but needs a "smart" server to work
I've tried git-annex
, which does everything I need it to do (and much more), but it is very difficult to use and not well documented. I've used it for several days and couldn't get my head around it, so I doubt any other coworker would be interested.
How do researchers deal with large datasets, and what are other research groups using?
To be clear, I am primarily interested in how other researchers deal with this situation, not just this specific dataset. It seems to me that almost everyone should have this problem, yet I don't know anyone who has solved it. Should I just keep a backup of the original data and forget all this version control stuff? Is that what everyone else is doing?
If you tell me three things I might have an answer! 1. Does your medium size data gets bigger? If so how periodically. 2. Do you use programming languages/frameworks to do pattern matching and data analysis? 3. Does anyone else use this data? If so do they change it? – None – 2015-02-13T10:40:38.030
I voted to "Leave Open" when reviewing Close Votes queue because I don't think this question is for specific situation. However, can anybody who is familiar with Software Recommendations SE tell us this question would fit into that site? – None – 2015-02-13T10:52:42.327
1@scaaahu I don't think this is necessarily a software question; an acceptable answer could also describe a workflow or combination of tools and systems. (Anyways, being on topic somewhere else shouldn't play into the decision to close a question here.) – None – 2015-02-13T11:01:58.770
2Just to protect against data corruption with image data, I periodically run a script that re-computes a checksum file with all files and their md5 checksums. The checksum file is then kept in git. Now I can immediately see with git diff if any of the checksums have changed. And I can also see which files have been removed & added. And if there are e.g. any signs of data corruption, then I can use the regular backups to restore old versions. Not perfect but better than nothing. – None – 2015-02-13T11:30:36.463
@JukkaSuomela I think you should post that as an answer, not a comment. – None – 2015-02-13T11:48:49.980
@Johann Why not just files, without version control (but with backups)? – Piotr Migdal – 2015-02-13T12:22:54.293
@PiotrMigdal: Are you seriously asking why people should use version control, instead of just having a bunch of files with backups?-) – None – 2015-02-13T13:01:50.247
1@JukkaSuomela I think it's a reasonable question when you've got very large datasets, if those datasets change frequently... in those cases, backup often is what's used as version control. – None – 2015-02-13T13:29:18.427
1I'm voting to close this question as off-topic because it deals with data/databases rather than something specific to academia. The questions is great, and (IMHO) should be moved to DataScience.SE or (perhaps) Databases.SE. – Piotr Migdal – 2015-02-13T14:02:35.853
@PiotrMigdal I don't know if this is off-topic, but this question does also not fit well with Databases.SE or DataScience.SE. I would like to know what other researchers/Institute do in practice to deal with this kind of problem - I've updated the question accordingly. – None – 2015-02-15T19:22:05.807
@DaveRose 1. Yes, I will hopefully add more experimental data and processed images, but not very often (maybe a few iterations); 2. Yes, that is part of my thesis; 3. Yes, others will hopefully use and change the data. – None – 2015-02-15T19:32:18.047
@JukkaSuomela That actually sounds pretty good (at least much better than anything I've found so far). – None – 2015-02-15T19:33:45.380
@PiotrMigdal Going without version control kind of pains me for the reasons I've stated in the question (especially: "did my data change with out me noticing?" and then 2 days of diffing by hand). So I am looking for something smarter. – None – 2015-02-15T19:35:56.057
@Johann 1. How you store or version control it is in domain of data science (regardless if you use it in academia, industry or for a hobby project). I really want to ensure the best answers and it is good to go where there are many experts in data. 2. My point was only that it might be not for git. (And, all in all, git is a filesystem). Do you want to diff per file, line, or what? – Piotr Migdal – 2015-02-15T19:53:37.380
@PiotrMigdal 1) You are probably right, though I am still curious how scientist - without a background in data science - handle that situation, 2) No doubt, git cannot handle that kind of data. A diff per file would be enough, just to see if my data has changed (or I changed it inadvertently). The point is to have control/documentation over how my data changed through what action by whom. – None – 2015-02-16T05:16:23.600
1
@Johann Data scientist have different backgrounds. Mine is in quantum mechanics, for example. The whole point here is that: 1. StackExchange discourages so-called boat questions and 2. its better to get best practices rather than how it is solved by people who had to solve it but had no idea.
– Piotr Migdal – 2015-02-18T12:33:42.243@PiotrMigdal Thats a good point - thanks! – Johann – 2015-02-18T13:59:44.580
@Johann Are you sure you need to version control your dataset? If you are processing a big set of images, you usually want to version control the procedures you followed that led to a modified set of images. Therefore, you usually don't need to keep track the images themselves. If you want to restore this modified set in the future, obviously you only need to take the original dataset and apply the procedures you did according to some commit in your code. – Robert Smith – 2015-02-21T00:50:06.050