I have $k = |K|$ arms with unknown distribution $\nu_\alpha$ over $[0,1]$ and unknown mean $\mu_\alpha \in [0,1]$ where $\alpha \in K$.
The action $A_t \in \{1, \dots, K\}$ is chosen at time $t$ based on previous observations and actions.
The reward signal is a binary value $X_t$ whose expectation is $E(X_t) = f(\mu_{A_t} + \epsilon_t)$ where $\epsilon_t$ is an unknown time series which is independent of $A_t$ and is not centered and not bounded and not of known periodicity. Furthermore $f$ is monotonously increasing.
I am looking to choose $A_t$ to maximize $X_t$ with knowledge about all past values if multiple $A_t$ have the same $\mu_{A_t}$ they should all continue to be choosen with equal frequency as t tends to infinity.
I am unsure what type of multi-armed bandit algorithm is right for me given this problem.
I had a look at multi-armed bandit with seasonality and other places.
Linear bandits don't work due the unknown $f$ and the unknown distributions and the binaryness. Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards is to powerful as it deals with a change of the dominant arm(s) over time. This is not the case here. On Slowly-varying Non-stationary Bandits imposes a drift limit which would be a limit on $\epsilon_t$ which i do not have.
An approach should not try to approximate $f$.
I am trying to use this in a setting where i am not sure if the effect size of the actions any where near comparable to the underlying trend and i only observe a binary outcome. If a different model is more suitable for this i am open to consider this.