Revision c021471c-3e8d-4ac0-bc72-61dc96f775cb - Software Engineering Stack Exchange

I have a data processing pipeline with well defined stages and IO boundaries. I can choose a language to suit the needs of this design. It starts with an `InputObject`. At the end of each stage, there is some additional data derived from some or all of the results of previous steps and the `InputObject`. That is, `StageN` extends `Stage` and produces an immutable `ResultsObjN`, which extends `Result`. After the pipeline is complete, the final output I'm interested in is some subset of all results from each step:

 InputObject
 |
 |
 v
 -----------
 | Stage 1 | <-- Tunable Parameters
 -----------
 |
 |
 InputObject
 ResultsObj1
 |
 |
 v
 -----------
 | Stage 2 | <-- Tunable Parameters
 -----------
 |
 |
 InputObject
 ResultsObj1
 ResultsObj2
 |
 |
 v
 ...

Currently, I'm modeling each stage as a `Stage` object. I'll explain what I mean by "tunable" parameters soon.

- Each `Stage` has each tunable parameters as attributes exposed by getters and setters.
- Each `Stage` is constructed with its input dependencies.
- Each `Stage` has a method to compute its result based on the tunable parameters

Here it is in almost-UML:

 --------------------------------------------------
 | <<abstract>> |
 | Stage |
 |--------------------------------------------------|
 |--------------------------------------------------|
 | +computeResult() : Result |
 --------------------------------------------------

For instance, `Stage3` computes `ResultObj3` using the `InputObject` and the results of `Stage2`. There are 2 parameters that can be set to change the results.

 --------------------------------------------------
 | Stage3 |
 |--------------------------------------------------|
 | -param1 : Int |
 | -param2 : Float |
 |--------------------------------------------------|
 | +Stage3(raw : InputObject, corners : ResultObj2) |
 | +get/setParam1() |
 | +get/setParam2() |
 | +computeResult() : ResultObj3 |
 --------------------------------------------------

I would like to reuse this processing pipeline pattern, with different stages in different quantities with different input dependencies. The tunable parameters may be tweaked by an automated optimizer or by a human using some UI. In either case, they follow the same feedback process. Here it is tuning stage 3:

 ResultObj3 performStage(InputObject raw, ResultObj2 corners):
 
 1. Stage processor = new Stage3(raw, corners)
 2. ResultObj result = processor.computeResult()
 3. while (notAcceptable(result)):
 4. tuneParams(processor, result) // tune to "more acceptable" params 
 5. result = processor.computeResult()
 6. return result

I feel like the tuning process should be performed for each step by a queue executor, but I'm not yet sure how to accommodate the growing set of differently typed results , or the varying construction requirements of each stage.

Here is some sample code of such a pipeline being constructed. The classes `PointExtractor`, `PointClusterer`, and `ClusterPlotter` all inherit from `Stage`. The `StageTuner` takes the class of the `Stage` to tune and implements the above feedback loop.

 ...
 QueuedPipeline pipeline = new QueuedPipeline()
 
 pipeline.setInput(raw) <-- raw is an InputObject taken as input elsewhere
 
 // addStage() takes the name of the stage, the stage runner, and the names of the stages it needs results from
 pipeline.addStage("ExtractPoints", new StageRunner(PointExtractor, ), [])
 pipeline.addStage("ClusterPoints", new StageRunner(PointClusterer), ["ExtractPoints"])
 pipeline.addStage("PlotClusters", new StageRunner(ClusterPlotter), ["ExtractPoints", "ClusterPoints"])
 
 // tune and execute each stage with the StageRunners
 PipelineResultList results = pipeline.run()

 Result someResult = results.getResult(1)
 Result anotherResult = results.getResult(4)
 
 saveResultFile(someResult, anotherResult)

Simply passing around classes for construction is a headache in some languages. Also, I'm not really sure how to generically construct and set parameters of each `Stage` subclass because of the different number/type of parameters - perhaps each `Stage` subclass needs a corresponding `StageRunner` subclass. **...but these are just symptoms of my real problem: This is all starting to get complex and fuzzy for what I expected to be a straightforward problem.**

Am I defining my objects and behaviours badly? Or is this not as straightforward a problem as I thought? Something else entirely?