Should I share my horrible software?

257

43

After I had published my paper, some people asked me to share the software that I developed. At first, I was very happy that my paper attracted some attention, and I was happy to share not only the binary but also the source code, case studies etc. But looking at my software, I feel very embarrassed.

My software is just horrible: the source code is just a mess, containing several of my unsuccessful attempts; I have never used design patterns, so duplicate code is everywhere; for simplicity and quick implementation, I often prefer recursions to loops etc etc.

I'm always under pressure to produce new results, and cleaning those code would cost me significant effort.

My question is if sharing this horrible software will give people a very negative impression of me? Would it do harm to my career if the people I share are prospect collaborators, employers, as they work in the same field.

qsp

Posted 2015-01-22T21:21:47.963

Reputation: 9 144

Is it possible to work with an undergrad or graduate CS student and have them 'clean up' some of the code? – J. Roibal – 2016-06-22T16:15:26.503

211Sounds like academic software. – Dave Clarke – 2015-01-22T21:30:53.047

I'm not sure how much i really can help you, but i have seen this video which based on your problem it might encourage you a little. https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0CCsQtwIwAg&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D0SARbwvhupQ&ei=k2rBVKP8J8P1OPbMgagL&usg=AFQjCNEIXszWW5AeYeh5TglmX2_yHFD7WA&bvm=bv.83829542,d.ZWU Sorry i couldn't help you more, i just wanted to show you this video which i thought was relevant.

– StephRus – 2015-01-22T21:26:57.413

62Dirty secret: most academic software is horrible. Even the stuff coming out of the Computer Science department. – Mark – 2015-01-22T23:33:22.977

66

You could release it under the CRAP license: http://matt.might.net/articles/crapl/

– mankoff – 2015-01-23T00:04:11.380

1Research generally involves trying a thousand things that doesn't work. If you manage to write code that works in the first few tries, you're probably not doing research but just implementing. – Lie Ryan – 2015-01-23T00:37:47.130

This paper on SIAM journal is a fascinating read and makes a very compelling argument. – Federico Poloni – 2015-01-23T08:14:45.857

24Another point to consider: What if some of your conclusions are based on false data originating from a bug in your software? Readers should be able to check for that. – Philipp – 2015-01-23T12:34:11.643

2

Posting your code in a public place such as GitHub gives you a chance to show how you have incrementally improved your software. Significantly improving the software without changing the results generated from the software is not an easy task and therefore is a skill that is highly valued. You might find posting some of the code to the code review site https://codereview.stackexchange.com/ to be of help here.

– shuttle87 – 2015-01-23T14:59:25.247

@qsphan Just out of curiosity, what's your paper about? – Quazi Irfan – 2015-01-23T19:56:12.600

1

To add to the conversation: http://www.phdcomics.com/comics/archive.php?comicid=1692

– Ander Biguri – 2015-01-23T21:49:03.660

3@mankoff: Please don't. That license is abysmal. Releasing it under 3-Clause BSD with a big sticker on it warning about the quality would be a much better option. – Bobby – 2015-01-24T12:57:56.767

a lot of non-academic software is a mess too, cf. recent bash bugs... – Dima Pasechnik – 2015-01-24T19:11:13.773

2Blah, as a programmer: "I'm always under pressure to produce new result, and cleaning those code will cost me significant effort." Do you realize the reason you're under so much pressure is because you spend so much time DEBUGGING the code that you never bothered to write or maintain correctly? – None – 2015-01-24T21:38:21.320

Even if the code is truly awful - which I doubt - it is still battle tested and debugged which is much, MUCH more valuable than using the latest design patterns etc. Be sure to add detailed instructions on the underlying platform (Ubuntu 12.04, OS X 10.7 with XCode with libX version Y etc) as there might be subtle differences giving problems, as well as full instructions on how to compile and link your programs. You probably already automated somewhat when writing it - just jot it down so others can see. – Thorbjørn Ravn Andersen – 2015-01-25T02:21:15.550

Nothing wrong with recursion unless the programming language doesn't support optimizing it – michel-slm – 2015-01-26T04:12:50.970

@Philipp: What worries me more is a situation in which the paper is correct, but the code has a nasty bug in it. In that case, fixing the bug would change the correctness of the answer but not (say) its performance characteristics... so if you release the code, then you risk embarrassing yourself even when your results might not be invalidated. (I've found these kinds of bugs in my own code before.) – Mehrdad – 2015-01-27T07:27:12.987

It doesn't matter if it is academic or not. http://blog.codinghorror.com/version-1-sucks-but-ship-it-anyway/

– anatoly techtonik – 2015-01-27T16:40:59.207

1@Mark: Most software is horrible, full stop. Not limited to academic, and not a secret really. ;-) – DevSolar – 2015-01-29T14:38:16.097

Just to add to this, even scientific software that is used every single day in major applications gets messy and horrible - the code used by the Met Office is still in Fortran and has bits left in all over the place from several decades ago... – Kvothe – 2015-01-31T16:43:56.593

2@djechlin While occasionally true, "The source of your pressure is your bad coding" is not always generalizable. Unless there was something about my Python skills that was driving the latest Ebola epidemic? – Fomite – 2015-04-22T16:49:27.043

@Fomite something about your Python skills is why you haven't finished your Ebola research yet, if you want to put it that way :P – None – 2015-04-22T18:37:32.740

1@djechlin Nah, I think it's likely more to do with the people still dying... – Fomite – 2015-04-22T19:39:25.323

@Fomite why bother writing code at all then if there are people dying either way? – None – 2015-04-22T20:36:28.600

@Fomite my point being if you're going to do it you should do it well and fast if it's the major bottleneck on your work. I suppose you could say the major bottleneck on your work is the prevalence of ebola, but after that, there is a frighteningly good chance it's how well your code is working. But if you want to insist the existence of ebola is the problem then more power to you. – None – 2015-04-22T20:37:42.027

@djechlin I'm just noting that the idea that the source of pressure on an academic boils down to bad coding is both profoundly presumptuous, and definitely not generalizable. – Fomite – 2015-04-22T21:29:20.727

Answers

271

Yes, you should.

First, most scientific software is terrible. I'd be very surprised if yours is worse than average: the mere fact you know design patterns and the difference between recursion and loops suggests it's better.

Second, it's unlikely you'll have the incentive or motivation to make it better unless, or until, it's needed by someone else (or you in 6 months). Making it open gives you that incentive.

Potential upsides: possible new collaborators, bugfixes, extensions, publications.

Potential downsides: timesink (maintaining code or fixing problems for other people), getting scooped. I'll be clear: I don't take either of these downsides very seriously.

Ian

Posted 2015-01-22T21:21:47.963

Reputation: 5 612

I've fooled around with GATT (a program for timetabling using genetic algorithms, the testbed for a series of PhDs in Scotland). It worked, but the code contained lots of commented out code, unused functions (from previous theses), over-general code (to accomodate different attempts) and very narrow, hard to adapt code side by side. Compound with convoluted data structures, which are sometimes kludged into non-designed uses. Cleaning that mess up would have been a huge task. But it was fine as it was for it's framework for exploration rôle. – vonbrand – 2016-02-28T13:39:29.030

38Most scientific software is NOT terrible. (I've been making a living working with & optimizing it for a couple of decades now, so I think I have an informed opinion.) It just has rather different criteria for goodness: working, getting correct answers in practical time, being extensible to the next theory, etc, rather than conforming to the latest quasi-religious design paradigm or language trend. And for the OP, your "horrible" software might, if cleaned up & commented a bit, be more accessible to other scientists than "good" code. – jamesqf – 2015-01-23T04:19:40.487

8I'd just add to this that I'd suggest sharing it via Github or similar, which has a vibrant community and makes it easier for people to collaborate, fork your project and contribute to it. – mmalmeida – 2015-01-23T09:29:24.823

54No mention of the elephant in the room, reproducibility? – Iwillnotexist Idonotexist – 2015-01-23T10:26:30.460

33@jamesqf : I exaggerated for effect, but my thinking is this. Most scientific software is short proof-of-principle, throwaway code. It rarely gets released outside of a small group and is, in my experience, poorly written. Most scientific software that lasts at a reasonable scale is not terrible (I've also worked with some on a similar timescale): it can't be to produce multiple publications. But I'm thinking about "all code written by scientists" here. – Ian – 2015-01-23T10:43:48.440

@mmalmeida I agree, Github is a conduit for collaboration. The issues & milestones offer a way to track problems & enhancement requests. Before, open sourcing this code, I would also recommend taking some time to clean it or at minimum add descriptive comments to it. – techmsi – 2015-01-23T14:43:49.170

15@jamesqf Well, most is terrible. I would also say I have an informed opinion. How code works is not a criterium for "goodness" of code for a programmer, at least the one considered a good one. Correctness is a "must" property. Divergence rate, numerical stability, speed (excluding hardware specific optimizations) etc. are properties of the algorithm the code implements. You can have good code and bad code performing the same. Although I do admit "good" is too vague and often results in misunderstanding. I would say "code quality" is very often terrible. Even if the program works great. – luk32 – 2015-01-23T17:19:38.777

1@jamesqf: most scientific software I've come across was not in any kind of source control, and this sounds like another example (or why else are all failed attempts still in there). – RemcoGerlich – 2015-01-23T21:18:23.813

@RemcoGerlich: Well, there's an example of having different criteria for 'goodness'. I've never yet found a source code control system that doesn't have an obvious major design flaw: checking files out of the system resets their timestamp to the checkout time, not last modification time. The implementers of these systems obviously think that is a 'good' feature - I've had several user forum arguments about it - but the end result is that I don't use version control unless I'm explicitly paid to do so. – jamesqf – 2015-01-24T00:03:48.177

28@jamesqf: don't take it personally, but that ranks pretty high on the list of dumb reasons not to use source control (besides, with "normal" VCSs it takes a script of maybe 20 lines to "fix" this "major design flaw"). – Matteo Italia – 2015-01-24T01:50:25.850

18@jamesqf I have never seen the file timestamp matter, at all. The only case I can even think that it affects anything is make, and when you have a clean checkout, there's no binaries to check if they need rebuilding. The benefits of source control are obvious: tracking changes (which is incredibly valuable in practice), protection against loss, easier sharing with colleagues, and enabling people to work on the same code base simultaneously. Your problem seems extremely obscure, and I would wager that those benefits alone outweigh the downsides of your particular use case – jpmc26 – 2015-01-24T07:52:35.190

The Sage project (http://www.sagemath.org) is a living example of software of math research that is not horrible, though there are constantly design discussions and complaints about coding practice.

– Christian – 2015-01-24T08:57:26.670

@jpmc26: How can the file times not matter? Perhaps if you only work on one set of code, they aren't that important, but if you have many different codes, each with many files (typical of scientific software, in my experience), it's really useful to be able to see at a glance (by doing e.g. 'll *.[chyl]' exactly which files you worked on when. Now I do agree that a CORRECTLY WORKING source code control system would be a great tool to have. The problem is that the !@#$s that build them absolutely refuse to implement this really simple fix. – jamesqf – 2015-01-24T18:52:22.387

23@jamesqf Commits (the core of most source control systems) have timestamps. In git for example, create a commit for each modification of your data files, for instance "reran simulation using new code from revision XXX". For each modification of code, make a commit that says "improved code in XXX for reason YYY". Then, instead of having a "last modification date" for your files, you get a nice list of commits, along with when they occurred, exactly what files were added/modified/deleted, and a helpful comment. There is no fix to be done, you simply don't know how to use source control properly. – Thomas – 2015-01-25T08:23:58.107

11Not to mention that if you really need this feature, any source control software lets you ask it when a file was last modified. In fact, most will go beyond that and tell you which commit last affected a file, giving you the context of the change in addition to the timestamp, which is way more helpful and practical than "oh, I last modified this file 26 days ago, I better look in my .txt file see what I did there... oh, wait, I forgot to write it down, oops". – Thomas – 2015-01-25T08:25:18.990

14@jamesqf Thomas nailed it. You look at the commit history to see what was changed, when, and how. With a diffing tool, you can even see a line by line comparison of the previous and current versions of the files (or even two previous versions). Your source control becomes the authoritative source for this information. This also makes it much easier to share the history with others, something I think would be valuable to the academic community. On something of a side note, you should usually put unrelated code bases in different repositories. This allows you to view the histories separately. – jpmc26 – 2015-01-25T09:47:00.480

@Thomas: Sorry, but this is the same line of BS that I get from the developers: either work the way we want you to work, or get lost. Sorry, but I have TRIED working your way, folks. It does not work for me. – jamesqf – 2015-01-26T06:27:45.573

1@jamesqf In that case, it'd be pretty easy to write a hook for most VCS systems to make it work the way you want. No reason not to have it both ways. – mattdm – 2015-01-26T13:47:27.223

9The reason the timestamps are reset is because if you get a changed file in your checkout that was changed yesterday, but you did a build this morning, most build systems would not pick up the change (because the timestamp was older than the build artifacts). With the timestamp set to time of checkout, the builds will work. – Alan Shutko – 2015-01-26T15:12:43.273

7@jamesqf Like it or not, using VCS is a pragmatic obligation. If you don't use VCS, many potential collaborators (including almost all experienced programmers) will find it very difficult to take your project seriously. And the impetus for those "quasi-religious" design paradigms is the pursuit of an ineffable elegance that makes the project extensible to the next theory and accessible to others. – Kevin Krumwiede – 2015-01-26T15:23:29.573

14@jamesqf There's a reason 'the developers' keep telling you that - it's because they're right. Also, yes, most scientific software is indeed pretty bad, including my own. I write code much differently when I'm writing it for a production project that I intend to actually maintain versus when I'm writing it as a proof-of-concept to run some simulations for a research paper. I've seen the same from nearly every piece of research code I've ever seen. There are a few exceptions, but they're just that: exceptions. – reirab – 2015-01-26T21:04:36.593

9@jamesqf "How can the file times not matter? Perhaps if you only work on one set of code, they aren't that important, but if you have many different codes, each with many files (typical of scientific software, in my experience), it's really useful to be able to see at a glance exactly which files you worked on when." Uh... isn't that exactly the problem version control is meant to solve? – Ajedi32 – 2015-01-26T21:07:46.797

2@jamesqf If using source control actively causes you problems, you probably just need to rethink your workflow a bit. If you can narrow down specific problems you're having, questions about how you can approach them differently would most likely be welcome on Programmers and StackOverflow. Give it a shot. You may be surprised at how much it improves your experience. I'll also say this: if you tried git, that system is really hard to learn. You might have an easier time getting started with SVN, which is far more intuitive and isn't as demanding on your workflow. SVN is far better than nothing. – jpmc26 – 2015-01-26T22:15:57.147

3As someone who works in software, no code is ever perfect. But don't let perfection be the enemy of “good enough”. As others have said, someone may well find your software useful. If you can take the time to clean it up a little and add some comments or a README so much the better :) – Owen Blacker – 2015-01-27T13:03:41.793

1It's not just scientific software that's terribly written. I've been developing in software shops and manufacturing firms alike for ten years now, and I promise you the code isn't getting any cleaner. Anywhere. Writing beautiful code is very hard. It's like writing a book. The first draft usually sucks, but it will actually tell the story. To sell a book you have to make it beautiful. To please your stakeholders in the plants, they just need it to tell the story. – corsiKa – 2015-01-28T16:17:46.740

1This comment thread has turned into an off-topic conversation about VCS. Please take extended discussion to [chat]. – eykanal – 2015-01-29T03:06:25.030

72

I would clean it up a little and share it. I've released a lot of code over the years, and also not released code for the reasons you give.

Go through it and and comment it, at whatever level you can. Leave in "failed attempts" and comment them as such. Say why they failed, and what you tried. This is VERY useful info for people coming after you.

Make a README file that says you are releasing it on request in the hope it helps someone. Say that you know the code is ugly, but you hope it's useful anyway.

Far too many people hold things back because it isn't perfect!

Blair MacIntyre

Posted 2015-01-22T21:21:47.963

Reputation: 2 306

Sharing failed attempts is different from mixing broken code into the working code, which doesn't accomplish anything except confuse and prevent understanding. – Matthew Read – 2015-09-24T04:03:21.647

35I can't endorse leaving in failed attempts. You should use version control instead. Including brief comments that explain why the initial attempt failed is fine, but including the actual failed code may be actively harmful. – David Z – 2015-01-23T09:57:31.797

5@DavidZ On the contrary, show it. If you use version control, people can see previous variations of your work, which is far from being useless. But if like here you don't use VC, then don't remove failed attempts. Put them in another file with appropriate comments. How harmful could it be? – coredump – 2015-01-23T10:51:03.013

25@coredump It can make the entire program virtually incomprehensible. I've seen this happen. If you don't use VC, start using it. The only way I could support a recommendation not to remove failed attempts is if you're forbidden from putting the code in VC for some reason which I can't imagine, and it's essential to see the previous code in order to understand what the current code does (which probably means the current code is bad too, though I admit that exceptions may exist). – David Z – 2015-01-23T10:56:13.913

3@DavidZ Sorry, but "Say why they failed, and what you tried" is a good advice, IMHO. If your code is messy and/or if you are not accustomized to software engineering practices, please leave it as-is and comment as much as you can. Removing useful informations could make things virtually incomprehensible. I've seen this happen ;-). Okay, so maybe there is a middle-ground between showing all the horrible things that were attempted and leaving useful traces. I think "I would clean it up a little" is also a good advice. – coredump – 2015-01-23T13:21:19.377

1@coredump Saying what you tried in the sense of including failed code is good advice when you're asking for help, but not for writing the program itself. – David Z – 2015-01-23T18:36:01.830

1I don't see how a failed code can help except for obfuscating the working code (unless errors are clear and e.g. "much simpler code, but works only for positive entries"). But what is good to make comments (e.g. "it would be tempting to use arrays, but they won't work for non-unique entries"). – Piotr Migdal – 2015-01-25T16:01:15.190

5If the code is an implementation of a research paper, and he tried to implement it different ways, my assumption is that the solution is non-obvious. In research, someone else might see the working code and think "I could do this better this way" which might be one of the ways the author tried first. We do poorly in CS at sharing our failures, which leads to lost work sometimes. That's my point. Whether it's a good choice inthis case can't be known without seeing the code, but I know plenty of other profs who share this view – Blair MacIntyre – 2015-01-25T16:08:11.877

51

Yes! Especially if your paper is e.g. about a new/improved algorithm that you've implemented, or you do significant non-standard data analysis, or basically anything where reproducing your results means re-implementing your software.

Papers seldom have room to give more than an outline. I know I've spent (= wasted) much too much time trying to implement algorithms from papers that left out critical (but not strictly relevant to the paper) details.

jamesqf

Posted 2015-01-22T21:21:47.963

Reputation: 1 076

7Very true comment about reproducibility, particularly, the second para. – Faheem Mitha – 2015-01-23T10:50:40.040

@E.P: Yes. I'm sorry, it's my dystypica cropping up again :-) – jamesqf – 2015-01-24T18:56:25.863

44

¿You think your code is messy? I have seen (and attempted to work with) code that gave me nightmares:

  • Five levels of if True nested, scattered at random places through the code.
  • Create an array of zeroes, convert it to degrees, take the cosine, and back to radians. Then, throw away the result.
  • On a software under heavy development, the list of "supported architectures" is so ancient (and they do say themselves) it would difficult to get your hands on one of these computer nowadays.
  • Features broken or modified several versions ago, still recommended in the docs.
  • Code that goes from using a standard format input to some format of their own. How to generate it? No one really knows, and the developers handwave a response.
  • Releases that don't even compile. (Did you even test it?)
  • GUI menus that you have to access in a specific order. Otherwise, you get a segmentation fault and have to start from the beginning.
  • Hard-coded paths scattered through the code. So you have to shift through several files finding and changing all the occurences of /home/someguy/absurd_project/working/ to yours.

And, my personal favourite, a certain program of thousands of lines of code, only used comments to eliminate random bits of code, except for one:

Here we punch the cards.

Still, no idea what it was doing.

And this is only leaving outside the classical good practice stuff, like one-letter variables all over the code, algorithms not specified anywhere...

If you are concerned about the quality of your code, it probably means you care enough to have made it better than the average. If you wait until the code is clean, it may never get out, and your scientific contributions will be partially lost.

In my opinion, the important things that you should care about, in order, are:

  1. Input and output formats. Use standards when available, make it simple when not. Make using your program as a black box easy.
  2. Commented. Brief descriptions of the functions, quick overview of the algorithm.
  3. Legibility. Using idiomatic code, good variable names...
  4. Structure. This is easier when you know what you want to do, that is usually not the case in research code. Only if there is interest in the community, you may consider refactoring it.

So, release your software whenever you have 1 (2 and part of 3 should come in as you are writing it).

Davidmh

Posted 2015-01-22T21:21:47.963

Reputation: 19 365

4+1 but I would also add appropriate error handling to the list of important points (all too often missing from rushed research code). In particular, take special care with any error that could affect the output silently - is that function return value the real number zero or a default-return-on-error zero? (Don't plot those!) Also, errors should be handled, but not over-handled. I've seen naïvely written "bulletproof" code that could silently recover from garbled input data and go on producing output without complaint. A crash may be frustrating, but inaccurate results can be a disaster. – DeveloperInDevelopment – 2015-01-24T11:23:45.113

6Re your point #3, and someone else's comment about single-letter variable names: in scientific software, you are often more-or-less directly translating math equations to code. If the variables in the equations are single letters, it makes perfect sense to use them as variable names in the code. And, as I admit I should do more often, include LaTeX for the equations in a comment. For the rest, you haven't really lived until you've tried to debug FORTRAN 66 with computed & assigned GOTOs :-) – jamesqf – 2015-01-24T19:05:00.410

+1 for answer from @imsotiredicantsleep. Code that silently fails is difficult to work with. If its going to generate inaccurate results, make sure that it generates a warning or throws an error instead. – Contango – 2015-01-26T13:02:15.103

20

You're asking whether sharing low-quality software would give a bad impression of you. I think that sharing software at all gives a good impression.

  1. As a computer scientist, I like when colleagues make their source code available. It makes me more likely to look deeper into their work, maybe contact them, maybe cite them, because there is one more artifact to interact with (not just the paper, but also the code).

  2. When a paper reports a result that is "proven" by source code, but the source code is not public, I'm often wondering whether the result is real. Looking at the source code (or just the availability of the source code, without ever looking at it) can convince me.

So sharing your source code, horrible or not, would always give me a good impression of you.

Now, if you want to impress even more, it would help ...

... if you react to issues or pull requests on a site like github, that is, when I see that others try to contact you and you react.

... if your code contains a readme file which relates the claims from your paper to the source code. This way, when I read the paper and want to know more, I can use the readme to jump to the appropriate place in the code. Typical phrases from such a readme could be: "The algorithm from Sec. 3.2 of the paper is in file algorithm/newversion/related/secondtry/foo.c" or "To repeat the run with the small dataset described in Sec. 2 of the paper, run "make; make second_step; foo_bar_2 datasets/christmas.dataset. This run takes about 2 days on my laptop".

You might also be interested in Matthew Might's CRAPL (Community Research and Academic Programming License), available on http://matt.might.net/articles/crapl/. It contains this term: "You agree to hold the Author free from shame, embarrassment or ridicule for any hacks, kludges or leaps of faith found within the Program". It is not clear to me whether this "license" has any legal effect, but the intent is clear: Release your ugly code, and don't think bad of the ugly code of others.

Toxaris

Posted 2015-01-22T21:21:47.963

Reputation: 1 295

12

Tangentially related, I will addresses how to share the software given your concerns (not should you share the software which you already have an answer for).

Putting the failed attempts in version control effectively means that nobody will ever see them. The way I handle this is to put each attempt in a method, and each failed attempt in a separate method:

def main():
    get_foobar(x, y)


def get_foobar():
    return x**y


def get_foobar_legacy_1():
    """
    This attempt did not work for values > 100
    """
    return x + y


def get_foobar_legacy_2():
    """
    This attempt did not work on Wednesdays in September
    """
    return x - y

As per the comments below, it may be a good idea to put these methods in a separate FailedAttempts or BadIdeas class. This has the nice effect of compartmentalizing the various stages for the process as per actual need. I find that computer programmers often have a knack for when to break logic off into a method and when not to, but computer scientists often do not. This approach helps the computer scientists break off into a method when necessary.

dotancohen

Posted 2015-01-22T21:21:47.963

Reputation: 221

That's not part of any best programming practice. Supposedly unused code should actually be commented out, lest some code keep calling get_foobar_legacy_43. And when it becomes clear it is broken, it should be removed if possible. If understanding some failed attempt is worthwhile for readers of the current version (which happens), you should put it in version control and add a comment pointing to the relevant commit ID — possibly with a permalink. – Blaisorblade – 2016-02-27T13:40:45.400

3@Blaisorblade: You are right, if the goal is to develop a well-functioning application then unused code should be removed, either via commenting or by relegating it to the depths of the source control software. However, that is not the goal stated by the OP. The OP needs to have his failures documented. This is the way to do that. Though I do see value in your point, and perhaps each method could be commented out with /* */ block comment syntax. Interestingly, one of the few languages that do not support block comments is Python, the language that I used for pseudo code above. – dotancohen – 2016-02-27T20:03:10.980

1@Blaisorblade: An even better solution may be to have a separate file, class, or directory which encompasses the failed attempts, separate from the mail code of the application. – dotancohen – 2016-02-27T20:04:49.753

Documenting failures is not stated in the question, and I think it's a good idea in few cases (say, for interesting but failed attempts to achieve the paper's contributions). "Leaving in failures" seems to come from another answer—where people had a strong debate: http://academia.stackexchange.com/a/37373/8966.

– Blaisorblade – 2016-02-28T10:29:53.520

8

I think you should share it. First of all you should do some basic clean up. (e.g.: no earlier code which is not used anymore; no code in comment; valid way of commenting and so on) Moreover if you put some "to do" in the code others can see that you were out of time and they can see your intentions. (e.g.: todo: this should be changed to enum) I also think you should share the most important part of your algorithms. When I share a code I have never share unimportant parts. Everyone can handle reading/writing of files, communication between threads, gui and so on. But don't share unreadable code. It would make no sense. So I think the middle way is the best as many times. :-)

David1199

Posted 2015-01-22T21:21:47.963

Reputation: 81

6Cleaning up is good in principle. However, if one waits for a good time to clean up, that time may never happen. I'd suggest putting it in a version control repository on Github or Bibucket or similar right away, and cleaning it up as and when you get around to it. Anyone downloading it will mainly be looking at the HEAD, anyway. – Faheem Mitha – 2015-01-23T10:49:24.687

7

Of course you should share the source code.

Academically speaking, a software-based result using code that is not readily available is not very valuable, as how would other people be able to verify your claims, if needed? Do you expect them to program on their own for this purpose? Sharing binaries only is much less valuable, and often leads to nightmares for people trying to run them.

Dima Pasechnik

Posted 2015-01-22T21:21:47.963

Reputation: 551

7

Lots of points in favour of publishing the code have been named in the other answers, and I completely agree with them. Hence, as the basic desirability of publishing the code has been discussed, I would like to supplement this with a checklist of further points that need to be considered. Many of these issues probably appear in virtually all academic software, so even if you cannot respond "This does not apply to my project." to all of them, you should at least be able to respond "This is a concern, but we can deal with this issue by ..." before publishing your code:

  • Are you allowed to publish the code?
    • Can you guarantee you only used code fragments that you are allowed to redistribute? Or did you possibly use code from non-open sources that you may use for your own internal software, but that you are not allowed to publish? Can you guarantee all the code that you used is allowed to be published in one complete package? License compatibility is a non-trivial issue.
    • Can you even reliably find out? Did you outsource any parts of your coding work, or integrate unpublished code from elsewhere? For instance, did you supervise any students during their graduation theses or employ any student research assistants, whose work was based upon your research and thus their code was added to your codebase? Did any co-workers contribute code to your codebase? Did they get some of their code from students? Did all of these people involved properly pay attention to licensing issues (if at all they had the knowledge to make an educated judgement about these licensing questions)? Can it even still be determined where which parts of the code originated? Do the people who contributed each part still know? Are they even still "within contact range" for you?
    • Was the code developed during working time based on third-party funds? If so, do the funding contract terms allow to publish the code, or do they include any requirements that software created within the funded project must be shared exclusively with the project partners?
  • Do you have sufficient resources (time and otherwise) to spend the effort to clean up the code and its comments in a way that it is still meaningful, but does not provide any information that must not become public?
    • Do you have any comments giving away who worked on the code? Were the people who contributed code officially allowed to work on the respective research, as per their funding? (Software developers are well aware that teamwork and reuse of components are core aspects of software development. Funding agencies, unfortunately, are typically very unaware of this and assume that if developer A is funded from project X and developer B is funded from project Y, A works exclusively on X and B works exclusively on Y, and revealing that, w.l.o.g., A spent only half an hour doing something that ended up in project Y could lead to severe consequences, such as reclaiming parts of the funding.)
    • Does anything in the published data give away any information about the particularities of how the work was done that must not become public? This is especially important if the whole commit history in a VCS is going to become public (or, practically, means that the commit history should never be published), but may also play a role in other situations. For example: Was any work on the code done outside of the officially assigned working times (e.g. during weekends)? Do working times give away that you worked more than the legal limit of your country for working hours per day? Do working times give away that you did not adhere to legally required breaks? Do working times give away that people assigned to other projects made contributions? Do working times provide any reason to distrust any of the statements you made otherwise about your working times (e.g. in project success reports that required a detailed assignment of working times to pre-defined work packages with certain maximum allotments)? Does anything give away that you worked in situations where you should not have been working (e.g. during a project meeting)? Does anything give away that you worked in locations where you should not have worked (e.g. from home, when your contract does not allow you to do home office, e.g. for insurance-related complications)?
    • Is there any secret information in the code, such as passwords, user account names, or URLs that must not be publicly known (because the servers are not laid out to handle larger amounts of users beyond a small number of select people who were given the URL for the test setup personally)?
  • Is the code usable by anyone else?
    • Will the code run, or does it require extensive configuration efforts? Can you spend the effort required to explain what configuration is necessary?
    • Can the code be compiled? Have you used any unofficial modified or custom-built compilers that are not publicly accessible? If so, does the code add anything beyond what may already be provided as a pseudo-code algorithm in your papers?
    • Does the code require any external resources? Will the code only be useful if it can access servers, libraries, or datasets that you cannot publish along with the code for one reason or another? Can at least a description of these resources be provided, or are their interfaces subject to some kind of an NDA?
    • Does the code make any irreversible changes to systems it runs on? For example, does it automatically change any system configuration (such as overwriting the system search path)? Does it perform any low-level access to hardware components that could, in certain constellations (that you internally avoid in your test setups) cause permanent damage to any components? Can you reliably warn users of the code of such possible unwanted side-effects?

O. R. Mapper

Posted 2015-01-22T21:21:47.963

Reputation: 15 415

1You consider funding agencies or employers sifting through commit logs to determine legal consequences. That's a clear theoretical concern. So, do you have any evidence of it ever happening? My limited experience with funding agencies, ERC grants in particular, is in fact the opposite, even though that doesn't count. – Blaisorblade – 2016-02-27T13:32:35.880

@Blaisorblade: "So, do you have any evidence of it ever happening?" - the motivation for the funder to discover possibilities to reduce their costs seems clear, and the possible repercussions that might be enforced (paying back some of the grant money) are sufficiently severe (losing previously granted money is probably one of the few things that can get uni employees into severe trouble with uni administration) that it seems reasonable not to open up this possible attack point in the first place. – O. R. Mapper – 2016-02-27T17:12:14.903

5

Of course. The only way you are going to get better at writing good software is to get feedback (all types). If you're afraid of feedback then you won't really get very far. The three basics to writing great software are practice, practice, and practice.

Now on as to the question of whether it would harm your career if people found out that your software writing skills aren't top notch. I think that no, on the contrary, they would respect you for your academic integrity. And would look forward to collaborating with you.

King

Posted 2015-01-22T21:21:47.963

Reputation: 151

5

Talk to some of the professors in your computer science department. See if any of them are looking for a project where students can clean up messy code to make it more presentable.

For the students who revise the code, this can be a good learning experience. What happens when coders program with a results-first mindset – or results only mindset? They get to see that first hand. They also get to apply some of those best practices they've been learning about. And they might be motivated to do an especially good job knowing that other professionals are already interested in seeing the code.

A professor might even make this into a contest, where teams of students all take a crack at revising the software, and the best result is shared with the rest of the world.

If their refactoring efforts flop, you're no further behind than you were. If that's the case, disclaimers are a wonderful thing. Simply share the code, but add a caveat: "It isn't pretty. When I wrote this, I was trying to get my research done – I wasn't thinking it would ever go outside my computer lab. But you're welcome to take a look if you really want to."

J.R.

Posted 2015-01-22T21:21:47.963

Reputation: 10 824

I like this, I would have loved to had this chance when I was at Uni. Reading and understanding other people's code is a skill, and it has to be learned. – Contango – 2015-01-26T12:52:31.783

3

Yes, you should. After all, the Linux kernel source code is quite a mess and that haven't prevented many professional developers from studying it and contributing patches and additions to it. Remember also that the Linux kernel is the base of the operating system that runs the fastest and most powerful supercomputers and most devices in the world. P.D: Linus Torvalds, the guy who invented the Linux kernel have a very profitable and successful career which have not been affected negatively or harmed in any way by the fact that the Linux kernel source code is messy.

user28382

Posted 2015-01-22T21:21:47.963

Reputation: 31

3

A reason that no one has mentioned why you should share your code is that you might find someone who is interested in collaborating with you, but who is prepared to spend more time cleaning up the code and making it work on different systems, etc. than on doing the innovative development that you have done.

Lots of people find this kind of work very satisfying and if your code is genuinely useful to them they might be happy to do it. In any case, you might find that getting feedback from people who have tried to use it, but need some kind of help, is a good motivation for you to make it more maintainable/easier to use and understand.

jwg

Posted 2015-01-22T21:21:47.963

Reputation: 1 235

2

You may just push it to GitHub and try to maintain a project in case other people who are interested about your project can access your code easily and maybe they can help to improve your code.

lovelyzlf

Posted 2015-01-22T21:21:47.963

Reputation: 56

0

You should definitely share your code.

For sorting things, make regions of the same parts of code like make a region of a failed attempt, and explain why it failed. Also, if you develop in Visual Studio, install the “CodeMaid” extension from Extension Manager and clean your complete solution. It will remove spaces and also remove unused references making most of the things look better.

If you develop in C# then share your code with me. I can also help you with sorting things out :)

Zuberi

Posted 2015-01-22T21:21:47.963

Reputation: 21

2@Zuberi005 is the only person to offer an automated solution to help clean up the code, and the only person to personally offer to help to clean up the code. And someone downvoted? Shame on them! – Contango – 2015-01-26T12:51:30.713

0

Share it if you want to, don't share it if you don't want to. I know this sounds snarky but I think there is too much pressure nowadays to "share everything" and people will try to make you guilty for not sharing, but really you have no obligation to share anything.

CaptainCodeman

Posted 2015-01-22T21:21:47.963

Reputation: 1 093

2Reproducable results are one of the cornerstones of the scientific method. And that requires sharing. You comment is akin to saying "... but really, Scientists have no obligation to adhere to the scientific method." – Contango – 2015-01-26T12:56:56.913

3Sure, sharing may be optional outside the scientific community, but it sure is not optional inside the scientific community. – Contango – 2015-01-26T12:58:05.353

1@Contango Yeah that's a fair point if releasing the software helps to reproduce the results. – CaptainCodeman – 2015-01-27T23:06:31.750

@JeffE I didn't share anything, what are you talking about? I find your message cryptic. If you wish to be understood, please be a bit more clear. – CaptainCodeman – 2015-01-29T22:27:34.977

You shared your opinion, of course. – JeffE – 2015-01-29T22:38:42.757

0

Put up a disclaimer that the code is provided "as is" with no promises of support, etc. And then share the code.

Case study: Turning a cloud of isolated points into a watertight surface is an extremely important practical problem, used everywhere from robotics to computer vision to processing data from 3D sensors like the Microsoft Kinect.

Poisson surface reconstruction is 7 years old and has long stopped being the state of the art for solving this problem. But everybody still uses it to this day. Why? Because the author released the code and it has since been incorporated into a bunch of popular geometry processing libraries. The paper now has over a thousand citations.

user168715

Posted 2015-01-22T21:21:47.963

Reputation: 3 278

0

Yes. You should release your code, probably under the CRAPL license. The goal is to build a better future - and your lousy code will help people do that. A caveat is that you should document how to successfully operate the code well enough for someone to have a decent chance of reproducing any published results.

And, don't worry - one bit of research code I worked on had been developed by 5 postdocs of indifferent programming ability for a series of projects over the course of about 8 years.

The list of global variables (just the names) was roughly 4 pages.

Roughly one third of them were used to set default behavior to change the functionality that functioned at a given moment. Another 20% were parallel data structures - meaning that they stored approximately the same data - and therefore functions in the code pulled from the data structures more or less at random. Yes. They were sometimes out of sync. And sometimes needed to be out of sync.

There were roughly 50 undocumented versions, stored in random portions of the group's server - each of which served at least one specific purpose - and only one admin kept those specific purposes in his head. It was more common than not to have people using the 'wrong' version for a given purpose.

The use of incredibly complex recursive procedures to, eg, write a file, was standard. Seriously - a few thousand lines to save image results.

Oh, and the remains of a butchered attempt to solve a memory leak (actually an invisible figure) by never creating a new variable.

Oh, and the database, that lovely database. About half of the data was unusable owing to (a) database design errors (b) data entry errors (in automatic programs). The code to retrieve files from the database was several hundred lines of logic long... The database itself was also kind enough to contain many copies of the same data, much with broken links between tables. Constraints? No. I watched a statistician proceed from disquiet to fear to tears to quitting within a month of being entrusted with the database...

There were somewhere between 0 and 1 ways to operate the software and retrieve correct results at any given instant...

And yes, there were gotos.

Oh, and in an effort to ensure opaque and nondeterministic operation, a series of computations was performed by calling GUI buttons with associated callbacks.

Approximately 90% of any given function was, quite reliably, not relevant to the result or to debugging of the result - being composed, rather, of short-term projects inserted and then never removed. Seriously - I wrote a feature complete version that actually worked that was 1/10th the size... Significant fractions were copy-pasted inserted functions, many of which differed from each other.

And, no Virginia, there is no documentation. Or descriptive variable names.

Oh, and the undocumented, buggy, dlls and associated libraries - generated using code that no longer existed.

All written in Matlab. In terms of Matlab coding practices, assume that copious use of eval would be the highlight of your day.

Seriously, your code isn't so bad.

That said, if you've done something actually useful, it might be career-enhancing to release a cleaned-up version so that other people will use and cite your library. If you've just done something, then reproduction is probably as far as you'd be well-advised to go.

erwin

Posted 2015-01-22T21:21:47.963

Reputation: 546