It depends on the research area but I think a sometimes understated requirement is being confidant about what produced intermediate data. Especially in biology, you can take TB's of data and form different intermediate data (which then is analyzed and makes more intermeidates). At some point you have to be able to go from raw data back through to the end result, and it can be tricky as scripts change. I found git annex huge for this, so I can treat data like code, even if it's terabytes. For other areas, it's more clear that there's a main source of data, and then analysis right to interpretable results