How do I control many datasets that are related and can change independently?

Question

Background: I'm a scientist working with a small team on data analytics. No one on our team has any experience with software engineering, though I'm trying to change that.

Suppose I have an array A that constitutes some measurements. I have a script that transforms A-->B. I also have 3 separate scripts that transform array B-->C, B-->D, and B-->E. Finally, I have a script that transforms array E-->F. Let's say each script takes 1 hour to run and that each of the arrays A--F are valuable by themselves. How would I go about creating a data file / database that ensures I'm always working with the latest versions of these arrays?

I've thought of two hackish solutions and I think both are inadequate.

One solution could be to write one single pipeline that computes everything and adds it to my data file/database with a singular version number. This would make sure that at one point in time, everything is up to date. However, this feels inadequate because if I modify the code that computes D from B, I would have to recompute everything. This way causes things to be coupled that aren't inherently coupled (i.e. I shouldn't have to compute A-->B, or B-->C, or B-->E, or E-->F in this example) and what should be a 1-hour fix will now take 5!
Another option is to consider each array creation its own pipeline. In this way, I could update the code that creates each array without modifying other arrays. However I also feel this is inadequate because if I modify the code that turns B-->E, I should also rerun the code that turns E-->F... but what if I forget? More deviously, what if a coworker modified the code that takes B-->E and doesn't run E-->F? I could be stuck working with out-of-date data and not even realize it. I've thought about attaching a version number to each array, but it seems extremely easy for the arrays in my simple example to get out-of-sync.

How do the data teams at larger companies solve this sort of issue?

I think you need much more than just a database here. Some kind of system that sees all of this, what version of data is in database, what script is scheduled to be run or already running, what data is expected to change due to script running, triggers on changes in the data or the scripts. If you value working with non-stale data a lot, creating such a system should be worth the time and effort. And I feel it is pre-requisite to your ability to scale the team above just few people. — Euphoric
– Euphoric, Commented Mar 6, 2023 at 6:47
Does each script take 1h to run due to the computational complexity? Or because they are implemented in a non-compiled scripting language, and written by scientist with little knowledge or interest in software optimization? Getting a factor 10-100 speedup is not that rare for unoptimized code, and may allow you to just recompute everything from the original data when needed. — JonasH
– JonasH, Commented Mar 6, 2023 at 7:54
@JonasH I'd say they are reasonably optimized. I used 1h in the description of the problem to simplify the problem, but really E-->F (the longest runtime of the pipelines) takes several hours. That script is written in FORTRAN and I believe it's fairly optimized. I certainly know the type of scientists that write highly unoptimized code.. — user400809
– user400809, Commented Mar 8, 2023 at 19:11
There is a potential contradiction here. If I've just updated A, and 5 minutes later you try to get F, you cannot get the version of F that accounts for my update. Can you clarify whether "the latest" refers to the latest individual set (i.e. where A shows updates that F does not yet) or if it refer to the latest complete set (where A through F are from the same generation)? — Flater
– Flater, Commented Mar 8, 2023 at 22:48

Hans-Martin Mosner · Accepted Answer · 2023-03-06 08:20:25Z

I don't know how teams at larger companies do it, I would personally try not to overthink it.

If your data changes (new versions of A) are in general more frequent than the code changes in your scripts, don't bother with rerunning the pipeline on script fixes but simply communicate to the data consumers that a bug has been found and that the data resulting from the buggy script should be considered questionable until the next regular pipeline run has been done.

If the frequency of bug fixes is higher, you seriously need to work on your QA, not micro-optimize the time to run the pipeline.

(Added) If you really need the described functionality, look at the dependency mechanisms used by Make and continuous integration systems such as Jenkins. Those are designed to perform minimal updates when something changes.

I've been wondering if I'm overthinking it, so it's nice to hear someone suggest that. For what it's worth, we don't know of any bugs when creating new versions of A; however we're working with data collected from an instrument and as we learn new ways to calibrate the data, the pipeline code that creates A changes as well. In practice, updating the calibration is just like fixing bugs but I suppose it feels different because the reason is due to our lack of understanding instead of QA failures. — user400809
– user400809, Commented Mar 8, 2023 at 19:24

JonasH · Accepted Answer · 2023-03-06 09:45:13Z

My preferred approach would be to see if the runtimes can be reduced. If the runtimes could be reduced from 1h to say 10min you could do all the processing in the same time as one step takes now. Such an improvement would be fairly realistic if moving from something like matlab to a compiled language. You might consider consulting an experienced software developer to help you port the code.

However, if most of your runtime is inside library functions the possible gains will be much less, and/or your team might not want to learn a new language. In this case it might be useful to treat each intermediate array as a cache or the original data.

When you transform A -> B you could also store the hash of A as well as a hash of the software used to do the transformation, this way you can check if the B data is up to date or not by just recomputing the hashes. Do the same for all the other transformations. Then write some software to do these checks in of each data sets in order, and optionally recompute them.

As an alternative to hashes you could maintain version numbers explicitly and use them in much the same way as a hash. But if version numbers has to be updated by hand, it is likely someone will forget to do so, and that would break the whole versioning system. Hashes avoid this problem, at some computational cost. Or you might be able to use some software like git as a source of versioning numbers.

Stack Exchange Network

How do I control many datasets that are related and can change independently?

2 Answers 2

Hot Network Questions

How do I control many datasets that are related and can change independently?

2 Answers 2

Related

Hot Network Questions