Background: I'm a scientist working with a small team on data analytics. No one on our team has any experience with software engineering, though I'm trying to change that.
Suppose I have an array A that constitutes some measurements. I have a script that transforms A-->B. I also have 3 separate scripts that transform array B-->C, B-->D, and B-->E. Finally, I have a script that transforms array E-->F. Let's say each script takes 1 hour to run and that each of the arrays A--F are valuable by themselves. How would I go about creating a data file / database that ensures I'm always working with the latest versions of these arrays?
I've thought of two hackish solutions and I think both are inadequate.
- One solution could be to write one single pipeline that computes everything and adds it to my data file/database with a singular version number. This would make sure that at one point in time, everything is up to date. However, this feels inadequate because if I modify the code that computes D from B, I would have to recompute everything. This way causes things to be coupled that aren't inherently coupled (i.e. I shouldn't have to compute A-->B, or B-->C, or B-->E, or E-->F in this example) and what should be a 1-hour fix will now take 5!
- Another option is to consider each array creation its own pipeline. In this way, I could update the code that creates each array without modifying other arrays. However I also feel this is inadequate because if I modify the code that turns B-->E, I should also rerun the code that turns E-->F... but what if I forget? More deviously, what if a coworker modified the code that takes B-->E and doesn't run E-->F? I could be stuck working with out-of-date data and not even realize it. I've thought about attaching a version number to each array, but it seems extremely easy for the arrays in my simple example to get out-of-sync.
How do the data teams at larger companies solve this sort of issue?