0
$\begingroup$

I have $k = |K|$ arms with unknown distribution $\nu_\alpha$ over $[0,1]$ and unknown mean $\mu_\alpha \in [0,1]$ where $\alpha \in K$.

The action $A_t \in \{1, \dots, K\}$ is chosen at time $t$ based on previous observations and actions.

The reward signal is a binary value $X_t$ whose expectation is $E(X_t) = f(\mu_{A_t} + \epsilon_t)$ where $\epsilon_t$ is an unknown time series which is independent of $A_t$ and is not centered and not bounded and not of known periodicity. Furthermore $f$ is monotonously increasing.

I am looking to choose $A_t$ to maximize $X_t$ with knowledge about all past values if multiple $A_t$ have the same $\mu_{A_t}$ they should all continue to be choosen with equal frequency as t tends to infinity.

I am unsure what type of multi-armed bandit algorithm is right for me given this problem.

I had a look at multi-armed bandit with seasonality and other places.

Linear bandits don't work due the unknown $f$ and the unknown distributions and the binaryness. Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards is to powerful as it deals with a change of the dominant arm(s) over time. This is not the case here. On Slowly-varying Non-stationary Bandits imposes a drift limit which would be a limit on $\epsilon_t$ which i do not have.

An approach should not try to approximate $f$.

I am trying to use this in a setting where i am not sure if the effect size of the actions any where near comparable to the underlying trend and i only observe a binary outcome. If a different model is more suitable for this i am open to consider this.

$\endgroup$

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.