We are building an LLM-based application which takes a lot of user data from various internal sources.
It then sends the data to various prompts which provide the answers needed to fill out forms correctly.
The prompts are being stored in the codebase to allow for rapid updates to different prompt chains which usually require updates to both prompts and code.
How should the metrics on the prompts be managed and matched back to the original prompt in git?
These include:
- Evaluation pipeline metrics,
- feedback from users,
- metrics on the individual prompts and
- metrics on the whole prompt chain
- etc.
Also how can the prompts either the whole chain or one part of the prompt chain be retrieved from git to be rolled back or re-evaluated in an efficient way?